Learning rate of students detecting and annotating pediatric wrist fractures in supervised artificial intelligence dataset preparations

The use of artificial intelligence (AI) in image analysis is an intensively debated topic in the radiology community these days. AI computer vision algorithms typically rely on large-scale image databases, annotated by specialists. Developing and maintaining them is time-consuming, thus, the involvement of non-experts into the workflow of annotation should be considered. We assessed the learning rate of inexperienced evaluators regarding correct labeling of pediatric wrist fractures on digital radiographs. Students with and without a medical background labeled wrist fractures with bounding boxes in 7,000 radiographs over ten days. Pediatric radiologists regularly discussed their mistakes. We found F1 scores—as a measure for detection rate—to increase substantially under specialist feedback (mean 0.61±0.19 at day 1 to 0.97±0.02 at day 10, p<0.001), but not the Intersection over Union as a parameter for labeling precision (mean 0.27±0.29 at day 1 to 0.53±0.25 at day 10, p<0.001). The times needed to correct the students decreased significantly (mean 22.7±6.3 seconds per image at day 1 to 8.9±1.2 seconds at day 10, p<0.001) and were substantially lower as annotated by the radiologists alone. In conclusion our data showed, that the involvement of undergraduated students into annotation of pediatric wrist radiographs enables a substantial time saving for specialists, therefore, it should be considered.


Introduction
The use of artificial intelligence (AI) for image analysis is one of the leading topic in the field of radiology [1][2][3][4]. Radiological AI models usually originate from annotated image data, also known as supervised AI [5] or supervised machine learning. With few exceptions, they fall into the domain of deep learning (DL) [6,7]. DL models commonly build upon large training image sets for robust outcomes [8,9], often containing thousands or more of different samples, such as in case of ImageNet [10], Open Images [11], or Microsoft Common Objects in Context (COCO) [12]. Corresponding radiological datasets [13,14]  since building and maintaining comprehensive deep learning systems is still challenging [8].
For image annotation a user may decide among a palette of open-source and commercial software solutions with manual and (semi-)automatic labeling techniques [8,[15][16][17][18]. However, they require area-specific expert information and development, and their implementation is often computational-and time-intensive [8]. As the workload of radiologists have increased significantly in the last decades, mainly due to the increasing number of time consuming cross-sectional images [19], alternative solutions such as with the involvement of alternative workforce in image annotation might be reasonable. Medical students have demonstrated variable learning rates in other medical contexts like surgery skills or ultrasound [20][21][22][23][24][25]. To our best knowledge, with the involvement of students in studying or annotating radiographic examinations no study has been performed so far.
The goal of the current study was to estimate the learning rate of inexperienced evaluators in labeling pediatric wrist fractures on digital radiographs. We recruited students with and without a medical background or training to annotate fractures and, thus, to assess their utility to radiologists in creating a comprehensive supervised deep learning dataset.

Methods
We recruited nine medical and one high-school student to the study. We arranged them into four single raters and three teams of two evaluators. None of these ten individuals had specific experience in analyzing pediatric wrist fractures. Table 1 shows the particulars of these students, including previous experience in radiology or traumatology. They were instructed to manually tag all visible fractures of any age in randomly selected non-overlapping pediatric wrist digital radiography (DR) studies. Each observer processed 1,000 images, composed of 100 pictures per workday over two weeks or 10 business days.
Moreover, they were asked to annotate a list of additional image tags (laterality, image projection) and classes (text, metal, bone lesion, periosteal reaction, rotational axis, foreign bodies, and soft tissue swelling) in every image, if proper to do so. We also requested the raters to judge and note the subjective difficulty of every X-ray picture on a five-point Likert scale (1 = Very easy, 2 = Easy, 3 = Neither easy nor hard, 4 = Hard, 5 = Very hard). The cumulative 7,000 student-assessed trauma radiographs were part of a comprehensive, already published dataset on pediatric trauma wrist examinations, containing 20,327 images in total [26].
Professional reporting workstations equipped with calibrated radiological 10-bit gray-level monitors RX240, RX440, or RX650 (Eizo, Ishikawa, Japan) displayed the X-ray studies in Two pediatric radiologists with seven (S.T.) and eight (R.M.) years of professional experience in childhood trauma imaging re-evaluated the student interpretations by consensually obtaining the number of true/false positive/negative fracture judgments. In cases where the reference radiologists were not able to ascertain the absence or presence of a fracture, they accepted the respective student classification as either true negative or true positive. The pediatric radiologists also recorded the time necessary to correct the erroneous annotations in the image sets. Long-term average labeling time per wrist image, including all previously-mentioned classifications and objects, was 22 seconds for radiologist 1 and 21 seconds for radiologist 2. Each day, a pediatric radiologist gave constructive feedback to six of the seven raters to enable appropriate learning progress. Rater 7 (defined as control) received no specialist response during the labeling period, but after completion of the annotation procedure.
Sensitivity (true positive rate = TPR), specificity (true negative rate = TNR), positive predictive value (PPV), negative predictive value (NPV), as well as the F 1 score [= 2 � (TPR � PPV) / (TPR + PPV)] [27], were among the main parameters of interest, calculated based on the aforementioned true/false positive/negative fracture numbers. The Intersection over Union (IoU) metric (or Jaccard Index) served as a measure of bounding box accuracy [28], compared between the reference radiologists and the student-produced annotations. The literature commonly describes an overlapping area of more than 50% as good accordance between annotations by different raters [29]. A self-written Python script computed the IoU value in every image.
We performed the statistical calculations with IBM SPSS Statistics version 21 (IBM, Armonk, New York, United States of America). The dataset was analyzed with descriptive statistics and comparisons of means, specifically t-tests and ANOVAs for group comparisons. Appropriate regression curves were fitted and selected to demonstrate learning rates or visualize progression over time. P values below 0.05 were assumed to be statistically significant.
The Ethics Committee of Medical University of Graz (IRB00002556) gave an affirmative vote for the retrospective data analyses (No. 31-108 ex 18/19), waiving the necessity to obtain informed consent.

Results
The reference radiologists diagnosed and labeled 6,072 fractures in 4,831 of 7,000 total wrist radiographs. Numbers of fractures ranged from 1 to a maximum of 3 per picture. Students   Table 2.  Sensitivity (average of 0.83) decreased with higher difficulty ratings (ANOVA p<0.001). In rating 1 it was 0.83, in 2 0.90, in 3 0.82, in 4 0.79, an in difficulty rating 5 0.69. The raters perceived images with a cast more difficult, with 2.12 ±1.05 vs. 2.27 ±1.07 (p<0.001). However, the number of errors was not differing significantly (p = 0.789).

Labeling precision
IoU increased statistically significant in all three groups over time (p<0.001), as graphically depicted in Fig 2A. IoU mean values were 0.45 ±0.28 in teams, 0.48 ±0.28 in individual raters, and 0.31 0. ±24 in control (ANOVA p<0.001). In the Bonferroni posthoc analysis, all groups were significantly different (p<0.001) with individual raters performing best.
We found that the IoU was significantly better in images with a present cast (0.49 ±0. The regression analysis revealed, that IoU and F1 score was similarly influence by patient age (Fig 2B). Image analysis was more challenging in the very young and in the older ages of life. However, the relation between F1 score and patient age was stronger (R 2 = 0.400) than in case of IoU (R 2 = 0.266).

Annotation and correction times
Times required to annotate the images decreased over the study period, as shown in Fig 3A. Mean net annotation time was 21.8 ±9.7 seconds per image; 25.2 ±8.1 seconds in teams,

Discussion
The current manuscript assessed the learning rates of students compared to board-certified pediatric radiologists in detecting and annotating childhood wrist fractures in the context of a supervised machine learning dataset generation.
The literature features only a few related studies on the learning rates of students in medical topics [21][22][23]. In the context of radiology, we found a few studies on ultrasound tasks [20,24] and emergency neuroimaging [25]. Our literature inquiry did not find any comparable study on students performing image annotations on radiographics with detecting pediatric fractures in the context of supervised AI workflows.
We saw marked learning progress of the raters receiving professional radiologist feedback. Some of the teams and individual raters were able to exceed an F1 score of 99, while no one of them dropped below 95 on the last day. However, nobody attained an F1 score of 100 during the annotations. In contrast to the control who did not get repetitive feed-back during the annotation process, all others achieved significantly higher scores beginning from the second annotation day. Fig 4 gives examples of fractures often missed by the raters. Teams and individual raters did not exhibit relevant differences in learning rate and error patterns. Therefore, we assume that radiologists should prefer single non-expert annotators over teams with respect to responsible management of human resources. As we expected, the control demonstrated near-steady results over the study timeframe of ten days. Difficulty differences between the datasets could be the cause of the perceptible daily variance. The radiologists also gave feedback to that person after the data acquisition to capitalize on the mistakes made. In the repetitive feedback sessions, the reference radiologists systematically assessed all images of the prior rating together with the raters. The reasons for the mistakes made were debated, when possible.
Annotations times were dependent on the individual rater and demonstrated a substantial variance, as demonstrated in Fig 3A. Overall there was a decrease in annotation time per image, approaching the typical annotation durations of the radiologists of 21 seconds in a single image. A comparable established system is in common use worldwide, when consultants sign the reports of their radiologists in training. In that setting, overall student and radiologist's annotation times together increased to about 30 seconds per picture. As compensation, the non-experts benefited by receiving feedback to achieve learning success. More importantly, the correction times for the experts decreased steadily (Fig 3B), which led to a correction time per image of about 10 seconds at day ten. This reduction means considerable time savings for the experts and could approximately double the respective annotation throughput as major bottleneck.
The study results imply that it was easy for students to learn recognition of fractures, whereas grasping the whole extension of many bone injuries was not possible for any of the raters within the study duration. While F1 scores (surrogate parameter for fracture recognition) were increasing substantially, we only saw a small increase in IoU (labeling precision) over the days. This discrepancy implies that the recognition of smaller details in the images was more challenging, e.g. even when recognized correctly, the students could not reproduce the actual extent of the seen fractures in many cases. The results of this study regarding learning performance in fracture detection may not be directly transferred to other body regions or other specific tasks. Further studies in this area appear to be legit.
Surprisingly, patient age clearly influenced the number of errors and the scorings, as depicted in Fig 2B. The F1 score and the IoU decayed in teenagers and newborns, with a plateau between approximately one and ten years. Our experience indicates that fusing growth plates of the distal radius and ulna at that age (compare Fig 5) hinder the correct annotations to a certain degree. In addition, subtle fractures of the ulnar styloid process and the carpal bones were diagnosed and missed more commonly in teenagers.
Several authors proposed deep-learning algorithms to enhance the speed of image annotation by professionals as one significant bottleneck [8,[30][31][32]. We hypothesized, that depending on the complexity and difficulty of the labeling task, the help of inexperienced annotators accelerates the marking process. Other methods available like training a neural network on a small subsample and then applying it onto the rest [33]. This approach is known as "Humanin-the-loop" (HITL) method, which is known in many fields of artificial intelligence, also in the field of computer vision [34][35][36]. HITL is an alternative to the approach in this manuscript using non-specialists to relieve workload from experts when creating supervised DL record sets. It is yet undecided, which of the mentioned techniques is superior to the others.
Some limitations need to be reported and discussed. The observers faced randomly chosen datasets without overlapping examinations. That implies a certain amount of variability in difficulty to solve them correctly. A specific study set might have been more straightforward. Daily rates of true and false ratings may be affected in both directions by an "easier" or "harder" selection of studies in combination with a "lucky" or "unlucky" rater. To minimize the resulting selection bias, we decided to present the students a substantial number of 100 images per day. Also, the reference radiologists' conditions on a particular day may influence the fracture assessment. We tried to overcome that type interference by accepting an index rating as correct if both reference radiologists were uncertain about a diagnosis. A reader should also keep the well-known fact of a reduced fracture detection sensitivity in plain radiographs in mind, which is methodically inherent. Another drawback is that we did not assess other parameters than fractures in greater detail, like bounding boxes containing text and metal, as there was a low rate of error and insignificant relevance for the project goals. Transcription errors during the correction phases are thinkable and may have occurred occasionally. However, the influence should be diminishingly small in our comprehensive dataset.

Conclusion
In conclusion, students can help detect and label pediatric fractures around the wrist, assisting radiologists in building a supervised artificial intelligence dataset. While the error rate in fracture recognition decreased quickly under feedback, bounding box precision was not improving as much. However, after a few days of instructing, substantial time savings for the specialists are possible. Our data showed no relevant benefit for employing teams over individual non-expert raters in that setting. Supporting information S1 Data. (XLSX)